Predicting Quality of Portuguese Vinho Verde White Wine

Group 009E05

Dongchao Chen, Katie Barton, Kyden Wang, Mason Feng, Xin Mu

The University of Sydney

How to pick a good bottle of white wine?

Introduction

  • Sample: Portuguese Vinho Verde White Wine
  • Key measurement of “Good wine”: Quality Assessment
  • 11 predictor variables: physiochemical properties e.g. density, alcohol, pH

Quality distribution of white wine

Visualize all variables and dependent variables

Heat map

Model Selection

  • Stepwise Selection
  • LASSO Regression
  • Ordinal Logistic Regression

Performance Metrics

  • \(RMSE\)
  • \(MAE\)
  • \(R^2\)
  • \(AIC\)

Stepwise Selection

After 10-fold CV:

\[\begin{align} \widehat{\text{quality}} = &154.106 + 0.068(\text{fixed.acidity}) -\\ &1.888(\text{volatile.acidity}) + 0.083(\text{residual.sugar}) +\\ &0.003(\text{free.sulfur.dioxide}) - 154.291(\text{density}) +\\ &0.694(\text{pH}) + 0.629(\text{sulphates}) + 0.193(\text{alcohol})\\ \end{align}\\ \;\\\] \[\begin{array}{c|cccc} & \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}\\ \hline \textrm{Stepwise Select} & 0.753 & 0.585 & 0.278 & 11171.41\\ \end{array} \;\\\]

Forward and backward select chose the same model! But what about multicollinearity? We can check with VIF.

Dealing with multicollinearity

\[VIF_i = \frac{1}{1-R^2_i}\]

OK

\[\begin{array}{ccccc} \textrm{volatile.acidity} & \textrm{sulfur.dioxide} & \textrm{pH} & \textrm{sulphates} & \textrm{fixed.acidity}\\ \hline 1.057 & 1.149 & 2.114 & 1.130 & 2.580\\ \end{array}\]

Not OK

\[\begin{array}{ccc} \textrm{alcohol} & \textrm{residual.sugar} & \textrm{density}\\ \hline 7.623 & 11.854 & 26.123\\ \end{array}\]

Remove “density”! But what if there was a better way?

Lasso Regression

\[\beta^{lasso}_\lambda = \underset{\beta}{\operatorname{\arg\max}} \Biggl\{ \underbrace{\sum_{i=1}^n\Biggl( y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\Biggr)^2}_{\text{Residual Sum of Squares}\; (RSS)}+\lambda\sum_{j=1}^p|\beta_j| \Biggr\}\]

Lasso Regression

\[\beta^{lasso}_\lambda = \underset{\beta}{\operatorname{\arg\max}} \Biggl\{ \underbrace{\sum_{i=1}^n\Biggl( y_i-\beta_0-\sum_{j=1}^p\beta_jx_{ij}\Biggr)^2}_{\text{Residual Sum of Squares}\; (RSS)}+\lambda\sum_{j=1}^p|\beta_j| \Biggr\}\]

Performing LASSO regression

10-fold CV gave us \(\log\lambda = -5.976\) or \(\lambda = 0.00254\). This gives:

\[\begin{align} \widehat{\text{quality}} = &2.732 - 0.039(\text{fixed.acidity}) -\\ &1.751(\text{volatile.acidity}) + 0.016(\text{residual.sugar}) -\\ &0.003(\text{chlorides}) -0.548(\text{free.sulfur.dioxide}) +\\ &0.036(\text{pH}) + 0.202(\text{sulphates}) +\\ &0.335(\text{alcohol})\\ \end{align}\\ \;\\\] \[\begin{array}{c|cccc} & \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}\\ \hline \textrm{Stepwise Select} & 0.753 & 0.585 & 0.278 & 11171.41\\ \textrm{LASSO} & 0.751 & 0.584 & 0.281 & 11173.49\\ \end{array}\]

Nice! We improved our model. However, as our predictor is a discrete variable, it may make more sense to use a different type of regression.

Ordinal Logit Regression

Our independent variable “quality” is an ordinal variable. We utilise the log-odds also known as the logit. Lets say we have \(J\) categories:

\[ P(Y\leq j) \]

\[\log\Biggl(\frac{P(Y\leq j)}{P(Y> j)}\Biggr) = \text{logit}(P(Y\leq j))\] \[\text{logit}(P(Y\leq j)) = \beta_0 + \sum_{i=1}^{J-1}(-\beta_jx_j)\] {.absolute bottom=“0” width=“320” height=“200” right = “0”}

Ordinal Logit Regression

Our independent variable “quality” is an ordinal variable. We utilise the log-odds also known as the logit. Lets say we have \(J\) categories:

\[ P(Y\leq j) \]

\[\log\Biggl(\frac{P(Y\leq j)}{P(Y> j)}\Biggr) = \text{logit}(P(Y\leq j))\] \[\text{logit}(P(Y\leq j)) = \beta_0 + \sum_{i=1}^{J-1}(-\beta_jx_j)\] {.absolute bottom=“0” width=“320” height=“200” right = “0”}

Ordinal Logit Regression

\[\begin{align} \text{logit}(P(\widehat{\text{quality}}\leq i) = &\alpha_i -(- 0.140(\text{fixed.acidity}) -\\ &5.336(\text{volatile.acidity}) + 0.060(\text{residual.sugar}) -\\ &2.809(\text{chlorides}) + 0.011(\text{free.sulfur.dioxide}) +\\ &0.423(\text{pH}) + 1.090(\text{sulphates}) + 0.972(\text{alcohol}))\\ \end{align}\\ \;\\ \alpha_i = (4.035, 6.378, 9.404, 11.959, 14.198, 17.874)\]

Where \(i\) denotes the boundary between quality 3 & 4, quality 4 & 5, quality 5 & 6, etc.

Ordinal Logit Regression Example

For a wine with fixed.acidity = 7, volatile.acidity = 0.27, residual.sugar = 20.7, chlorides = 0.045, free.sulfur.dioxide = 45, pH = 3, sulphates = 0.45 and alcohol = 8.8, and we want to find the probability that this wine has a quality of 6 or less:

\[\begin{align} \text{logit}P(\widehat{\text{quality}}\leq 6) = &9.404 -(- 0.140(\text{fixed.acidity}) -\\ &5.336(\text{volatile.acidity}) + 0.060(\text{residual.sugar}) -\\ &2.809(\text{chlorides}) + 0.011(\text{free.sulfur.dioxide}) +\\ &0.423(\text{pH}) + 1.090(\text{sulphates}) + 0.972(\text{alcohol}))\\ P(\widehat{\text{quality}}\leq 6) = &\exp(-0.099) = 0.906 \end{align}\\\]

If we want the probability that the wine is of quality 6:

\[\begin{align} P(\widehat{\text{quality}} = 6) = & P(\widehat{\text{quality}}\leq 6) - P(\widehat{\text{quality}}\leq 5) - P(\widehat{\text{quality}}\leq 4)\\ = &0.858 \end{align}\\\]

So we have an \(85.5\%\) probability that our particular wine has a quality of 6.

Ordinal Logit Regression

\[\begin{array}{c|ccccccc} & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\ \hline 3 & 0 & 1 & 7 & 11 & 1 & 0 & 0\\ 4 & 0 & 4 & 96 & 62 & 1 & 0 & 0\\ 5 & 0 & 2 & 723 & 724 & 8 & 0 & 0\\ 6 & 0 & 0 & 377 & 1678 & 143 & 0 & 0\\ 7 & 0 & 0 & 52 & 642 & 186 & 0 & 0\\ 8 & 0 & 0 & 1 & 117 & 57 & 0 & 0\\ 9 & 0 & 0 & 0 & 3 & 2 & 0 & 0\\ \end{array}\\\] \[\begin{array}{c|cccc} & \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}\\ \hline \textrm{Stepwise Select} & 0.753 & 0.585 & 0.278 & 11171.41\\ \textrm{LASSO} & 0.751 & 0.584 & 0.281 & 11173.49\\ \textrm{Ordinal} & - & - & 0.312^* & 11001.58\\ \end{array}\]

Conclusions

Best model: Ordinal Logit Regression Model

\[\begin{array}{c|cccc} & \textrm{RMSE} & \textrm{MAE} & R^2 & \textrm{AIC}\\ \hline \textrm{Stepwise Select} & 0.753 & 0.585 & 0.278 & 11171.41\\ \textrm{LASSO} & 0.751 & 0.584 & 0.281 & 11173.49\\ \textrm{Ordinal} & - & - & 0.312^* & 11001.58\\ \end{array}\]

OLR predicted correctly 54% of the time.

\(\;\)

Trade-off: Better accuracy for worse Interpretation

Conclusions

For consumers: Nutrition facts can be helpful for wine choice.

For companies: Raise prices for higher perceived quality and increased profits.

Conclusions

For consumers: Nutrition facts can be helpful for wine choice.

For companies: Raise prices for higher perceived quality and increased profits.